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Abstract 

Background: Phylogeny estimation from aligned haplotype sequences has attracted more and more attention in 
the recent years due to its importance in analysis of many fine-scale genetic data. Its application fields range from 
medical research, to drug discovery, to epidemiology, to population dynamics. The literature on molecular 
phylogenetics proposes a number of criteria for selecting a phylogeny from among plausible alternatives. Usually, 
such criteria can be expressed by means of objective functions, and the phylogenies that optimize them are referred 
to as optimal. One of the most important estimation criteria is the parsimony which states that the optimal phylogeny 
7* for a set T-L of n haplotype sequences over a common set of variable loci is the one that satisfies the following 
requirements: (i) it has the shortest length and (ii) it is such that, for each pair of distinct haplotypes hj, hj e T-L, the sum 
of the edge weights belonging to the path from hj to hj in 7* is not smaller than the observed number of changes 
between hj and hj. Finding the most parsimonious phylogeny for H involves solving an optimization problem, called 
the Most Parsimonious Phylogeny Estimation Problem (MPPEP), which is ATP-hard in many of its versions. 

Results: In this article we investigate a recent version of the MPPEP that arises when input data consist of single 
nucleotide polymorphism haplotypes extracted from a population of individuals on a common genomic region. 
Specifically, we explore the prospects for improving on the implicit enumeration strategy of implicit enumeration 
strategy used in previous work using a novel problem formulation and a series of strengthening valid inequalities and 
preliminary symmetry breaking constraints to more precisely bound the solution space and accelerate implicit 
enumeration of possible optimal phylogenies. We present the basic formulation and then introduce a series of 
provable valid constraints to reduce the solution space. We then prove that these constraints can often lead to 
significant reductions in the gap between the optimal solution and its non-integral linear programming bound 
relative to the prior art as well as often substantially faster processing of moderately hard problem instances. 

Conclusion: We provide an indication of the conditions under which such an optimal enumeration approach is likely 
to be feasible, suggesting that these strategies are usable for relatively large numbers of taxa, although with stricter 
limits on numbers of variable sites. The work thus provides methodology suitable for provably optimal solution of 
some harder instances that resist all prior approaches. 
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Background 

Molecular phylogenetics studies the hierarchical evolu- 
tionary relationships among species, or taxa^ by means of 
molecular data such as DNA, RNA, amino acid or codon 
sequences. These relationships are usually described 
through a weighted tree, called a phylogeny, whose leaves 
represent the observed taxa, internal vertices represent 
the intermediate ancestors, edges represent the estimated 
evolutionary relationships, and edge weights represent 
measures of the similarity between pairs of taxa. 

Accurately characterizing evolutionary relationships 
between organisms and their genomes is the basis of com- 
parative genomic methods for interpreting meaning in 
sequence data, and for this reason the use of molecular 
phylogenetics has become widely used (and sometimes 
indispensable) in a multitude of research fields such as 
systematics, medical research, drug discovery, epidemi- 
ology, and population dynamics [3]. For example, the 
use of molecular phylogenetics was of considerable assis- 
tance in predicting the evolution of human influenza A 
[4], understanding the relationships between the viru- 
lence and the genetic evolution of HIV [5,6], identifying 
emerging viruses as SARS [7], recreating and investigating 
ancestral proteins [8], designing neuropeptides causing 
smooth muscle contraction [9], and relating geographic 
patterns to macroevolutionary processes [10]. 

The literature on molecular phylogenetics proposes a 
number of criteria for selecting the phylogeny of a set 1-L 
of haplotypes extracted from n taxa from among plau- 
sible alternatives. The criteria can usually be quantified 
and expressed in terms of objective functions, giving 
rise to families of optimization problems whose general 
paradigm can be stated as follows [11]: 

Problem 1. - The Phylogeny Estimation Problem (PEP) 
optimize f{T) 

s,t, g(n.T) = i 

where T a phylogeny of T the set of all possible phylo- 
genies of H,/ : T ^ Ma function modeling the selected 
criterion of phylogeny estimation, and g : H x T ^ ^ 
is a characteristic function equal to one if a phylogeny 
T is compatible (according to the selected criterion) for 
the set T-L, A specific optimization problem is completely 
characterized by defining the functions / and g, and the 
phylogeny T* that optimizes/ and satisfies g is referred to 
as optimal 

One of the classic criteria for phylogeny estimation is 
the parsimony criterion, which assumes that one taxon 
evolves from another by means of small changes and 
that the most plausible phylogeny will be that requiring 
the smallest number of changes. That evolution proceeds 



by small rather than smallest changes is due to the fact 
that the neighborhood of possible alleles that are selected 
at each instant of the life of a taxon is finite and, per- 
haps more important, that the selective forces acting on 
the taxon may not be constant throughout its evolution 
[12,13]. Over the long term (periods of environmen- 
tal change, including the intracellular environment), the 
accumulation of small changes will not generally corre- 
spond to the smallest possible set of changes consistent 
with the observed final sequences. Nevertheless, it is 
plausible to believe, at least for well-conserved molec- 
ular regions where mutations are reasonably rare and 
unlikely to have occurred repeatedly at any given variant 
locus, that the process of approximating small changes 
with smallest changes could provide a good approxima- 
tion to the true evolutionary process of the observed set 
of taxa [14]. Such an assumption is likely to be reason- 
able, for example, in intraspecies phylogenetics, where few 
generations have elapsed since the observed taxa shared 
a common ancestor and thus the expected number of 
mutations per locus is much less than one. When such 
assumtions hold, a phylogeny of 1-L is defined to be optimal 
under the parsimony criterion if it satisfies the following 
requirements: (i) it has the shortest length, i.e., the mini- 
mum sum of the edge weights, and (ii) it is such that, for 
each pair of distinct haplotypes /z/, hj e the sum of the 
edge weights belonging to the path from hi to hj in T* is 
not smaller than the observed number of changes between 
hi and hj [11]. The first condition imposes the assump- 
tion that the smallest number of mutations consistent with 
the observed sequences is a good approximation to the 
true accumulated set of mutations; the second condition 
correlates the edge weights to the observed data. 

The parsimony assumption can be considered accu- 
rate in the limit of low mutation rates or short time 
scales, making it a reasonable model for situations such as 
analysis of intraspecies variation where little time is pre- 
sumed to have elapsed since the existence of a common 
ancestor of all observed taxa. Maximum parsimony also 
remains valuable as a model for novel methodology devel- 
opment in phylogenetics because of its relative simplicity 
and amenity to theoretical analysis. Novel computational 
strategies, such as those developed in this paper, might 
therefore productively be developed and analyzed in the 
context of maximum parsimony before being extended to 
more complicated models of phylogenetics. 

Finding the phylogeny that satisfies the parsimony cri- 
terion involves solving a specific version of the PEP, 
called the Most Parsimonious Phylogeny Estimation Prob- 
lem (MPPEP). Some of the variants of the MPPEP, see 
e.g., [15,16], can be solved in polynomial time, however, 
in the most general case, the problem is ATP-hard 
[11,17] and this fact has justified the development of a 
number of exact and approximate solution approaches. 
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such those described in [11,17,18]. Some recent ver- 
sions of the MPPEP, such as the Most Parsimonious 
Phylogeny Estimation Problem from SNP haplotypes 
(MPPEP-SNP) investigated in this article, play a funda- 
mental role in providing predictions of practical use in 
several medical bioinformatics applications, such as dis- 
ease association studies [19] or reconstruction of tumor 
phylogenies [20,21]. In these contexts, it would be highly 
desirable to have the most accurate inferences possible 
for specific applications, but this in turn would imply 
to have algorithms able to exactly solve instances of 
such versions. As regards the MPPEP-SNP, the literature 
describes some (rare) circumstances for which it is possi- 
ble to solve the problem in polynomial time (see Section 
Methods). Unfortunately, in the general case the MPPEP- 
SNP is ATT^-hard and solving provably to optimality there- 
fore generally requires the use of exact approaches based 
on implicit enumeration algorithms, similar to the mixed 
integer programming strategies described in [1,2,22]. 

In this article, we explore the prospects for improv- 
ing on the implicit enumeration strategy of [1,2] using a 
novel problem formulation and a series of additional con- 
straints to more precisely bound the solution space and 
accelerate implicit enumeration of possible optimal phylo- 
genies. We present a formulation for the problem based on 
an adaptation of [23] s mixed integer formulation for the 
Steiner tree problem extended with a number of prepro- 
cessing techniques and reduction rules to further decrease 
its size. We then show that it is possible to exploit the 
high symmetry inherent in most instances of the prob- 
lem, through a series of strengthening valid inequalities, 
to eUminate redundant solutions and reduce the practical 
search space. We demonstrate through a series of empiri- 
cal tests on real and artificial data that these novel insights 
into the symmetry of the problem often leads to signifi- 
cant reductions in the gap between the optimal solution 
and its non-integral linear programming bound relative to 
the prior art as well as often substantially faster process- 
ing of moderately hard problem instances. More generally, 
the work provides an indication of the conditions under 
which such an optimal enumeration approach is likely to 
be feasible, suggesting that these strategies are usable for 
relatively large numbers of taxa, although with stricter 
limits on numbers of variable sites. The work thus pro- 
vides methodology suitable for provably optimal solution 
of some harder instances that resist all prior approaches. 
In future work, it may provide useful guidance for strate- 
gies and prospects of similar optimization methods for 
other variants of phylogeny inference. 

Methods 

Notation and problem formulation 

Before proceeding, we shall introduce some notation and 
definitions that will prove useful throughout the article. 



The human genome is divided in 23 pairs of chromo- 
somes, i.e., organized structures of DNA that contain 
many genes, regulatory elements and other nucleotide 
sequences. When a nucleotide site of a specific chromo- 
some region shows a variability within a population of 
individuals then it is called a Single Nucleotide Polymor- 
phism (SNP). Specifically, a site is considered a SNP if 
for a minority of the population a certain nucleotide is 
observed (called the minor allele) while for the rest of the 
population a different nucleotide is observed (the major 
allele). The minor allele, or mutant type [24], is generally 
encoded as '1', as opposed to the major allele, or wild 
type [24], generally encoded as '0! A haplotype is a set 
of alleles, or more formally, a string of length m over an 
alphabet!] = {0,1} [25]. 

Given a set H ofn haplotypes, denote S = {1, 2, . . . , m} 
as the set of alleles and /z/(5), 5 g 5, as the 5-th allele of hap- 
lotype hi e %. Given two distinct haplotypes /z/, hj G 
we denote Sh^hj the subset of different alleles between 
hi and hj, dp^k. = J^seS^.h- ~ ^J^^^\ distance 
between hi and /zy, and we say that hi and hj are adjacent 
if dkihj = 1- From a biological point of view, the adjacency 
between a pair of distinct haplotypes means that one of 
the two haplotypes evolved from the other by mutation of 
a specific SNP over time. 

Consider a graph G = (H, E) having a vertex for each 
haplotype in % and an edge for each pair of adjacent hap- 
lotypes hi, hj G T-L. Then, a phylogeny T of is a spanning 
tree of G, i.e., an acyclic subgraph of G in which a pair of 
vertices hi.hj G T-L is adjacent in T if d^-h. = 1 . It is worth 
noting that, according to the definition of the edge set E, 
in general a phylogeny of H may not exist as the graph 
G = (HyE) might not be connected. To always ensure 
the existence of a phylogeny for we introduce the set 
which consists of the minimum number of haplotypes 
that should be added to % in such a way that, defined 
n = nuW2indE= {(hi, hj) : hi, hj e % and dh.hj = Ih 
the graph G = (H,E) is connected. From a biological 
point of view, the set W represents the set of haplotypes 
that are ancestors of the observed ones but either had 
gone extinct or just were not observed in that sample (also 
called Steiner nodes). 

Denote T as a phylogeny of H, E(T) as the edge set 
of T, and L(T) as the length of the phylogeny T, i.e., the 
sum of the distances dpiihj> for all (hi,hj) G E(T). Then, 
the problem of finding a phylogeny of Ti that satisfies 
the parsimony criterion consists of solving the following 
optimization problem: 

Problem 2. The Most Parsimonious Phylogeny Estima- 
tion Problem from SNP haplotypes (MPPEP-SNP), 
Given a set Tiofn haplotypes having m alleles each, find 
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the minimum cardinality haplotype set li! to he added to 
% so that the phylogeny T has minimum length. 



If the haplotype set T-L' is empty, i.e., if G = fH, E) is 
connected, then MPPEP-SNP can be solved in polynomial 
time as the minimum spanning tree is a (optimal) solution 
to the MPPEP-SNP. Similarly, if the input haplotype set H 
satisfies the perfect phylogeny condition i.e., the require- 
ment that each allele changes only once throughout the 
optimal phylogeny (see [19]), then the MPPEP-SNP can be 
still solved in polynomial time [26-28]. Unfortunately, it is 
possible to prove that in the general case the MPPEP-SNP 
is ATP-hard (see [1,22]). In fact, the binary nature of the 
SNP haplotypes allows us to interpret a generic haplotype 
hi G 'H as a vertex of a m-dimensional unit hypercube, its 
5-th allele as the 5-th coordinate of the vertex hu and the 
set %' as the set of Steiner vertices of the unit hypercube. 
Hence the MPPEP-SNP can be seen as particular case of 
the Steiner tree problem in a graph, a notorious ATP-hard 
combinatorial optimization problem [29] . 

Finding the optimal solutions to the MPPEP-SNP is fun- 
damental to validating the parsimony criterion, i.e., to 
verify whether, for a given instance of MPPEP-SNP, the 
results predicted by the criterion match the experimental 
ones. Unfortunately, the ATP-hardness of the MPPEP- 
SNP limits the size of the instances analyzable to the 
optimum, which in turn affects the ability to validate the 
parsimony criterion, hence the practical utility of the pre- 
dictions themselves. In order to address this concern, in 
the following section we shall develop an integer pro- 
gramming model able to provide optimal solutions to real 
instances of the MPPEP-SNP. 



A mixed integer programming model for the MPPEP-SNP 

Let F = {l,2,...,w,w + l,w + 2,...,fz + \U'\} the set 
of potential vertices of a phylogeny T of H and assume 
the convention to denote the n haplotypes in H as the 
first n vertices in V. The first task in designing an inte- 
ger programming model for the MPPEP-SNP that uses a 
polynomial-size number of variables consists of charac- 
terizing V, i.e., determining an upper and a lower bound 
on the cardinality of the set T-L\ In fact, observe that W 
contains potentially an exponential number of haplotypes, 
namely all vertices of the unit hypercube that belong to the 
set {0, 1}^ \T-L. However, we can easily bound the cardi- 
nality of by means of the following approach. Consider 
the complete graph G = (T-L,E), where E = {(hi,hj) : 
hi, hj e %}, and construct a minimum spanning tree of 
G. Denote E(T^) as the set of edges (hi.hj) of T^, Then, 
an upper bound UB on the overall number of Steiner 
vertices of the optimal phylogeny T can be obtained by 
computing the sum 



1). 



{hi,hj)eE{T^) 

Similarly, denote L{T^) = T.{hi,hj)eE{T^)^hihj> a lower 

bound LB on the overall number of Steiner vertices of T 
can be obtained as [30,31]: 



LB = 



- w + 1 



Denote Ui, i e V^, as a decision variable equal to 1 if the 
i-th vertex of V is considered in the optimal solution to 
the MPEPP-SNP and 0 otherwise; x^- as a decision variable 
equal to 1 if in the optimal solution to the MPPEP-SNP 
the 5-th coordinate of the vertex Ui, i e V, is 1 and 0 
otherwise; z^-j as a decision variable equal to 1 if in the 
optimal solution to the MPPEP-SNP the pair of distinct 
vertices e V has a change at 5-th coordinate, and 0 
otherwise; and yij as a decision variable equal to 1 if the 
pair of distinct vertices /,/ e V is adjacent in the optimal 
solution to the MPPEP-SNP and 0 otherwise. Finally, let 
y-H = {1, 2, . . . , n}, V^/ = + 1, n + 2, . . . , n + UB}, and 
Q = {1, 2, ...,« + LB}. Then, a valid formulation for the 
MPPEP-SNP is the following: 



Formulation 1. Basic Model 

-in E E4 

i,jeV:i^j seS 

s.t. x\ = hi(s) 



(la) 



4 ^ 



WseSJen (lb) 
yseSJeV (Ic) 
4 > -x]+yij-l yseS, ij eVii^j 

(Id) 

4 > -x' -\-xfj +yij - 1 V5 G S,i,j eV'A^j 

(le) 

Y.Aj=yij WiJeV:i^J (If) 

seS 

yij<Ui ^i,jeV:ii^j (Ig) 

ytjSuj WiJeV'.iy^j (Ih) 

i,jeC:i^j ieC 

(Ij) 

QeVd^j ieV 

Y,Ui = n-\-LB (11) 

ieQ 



Ui, x], z'^j^yij e {0,1}. 



(Im) 



The objective function (la) aims at minimizing the 
length of the optimal phylogeny. Constraints (lb) impose 
that the coordinates of the first n vertices in V are exactly 
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the ones of the input haplotype set Constraints (Ic) 
impose that the 5-th coordinate of vertex Ui, i g V, can 
assume value 1 only if vertex Ui is considered in the opti- 
mal solution to the problem. Constraints (Id)-(le) force 
variables z]^ to be one if in the optimal solution to the 
problem there exist a pair of adjacent vertices /,/ G V hav- 
ing a different value at the 5-th coordinate. Constraints (If) 
impose that in an optimal solution to the problem two dis- 
tinct vertices g V can be adjacent only if d^^hj = 1- 
Constraints (Ig)-(lh) impose that in the optimal solution 
to the problem variable yij may assume value 1 only if both 
vertices / and / are considered. Vice versa, constraints (li) 
impose that if in the optimal solution to the problem a 
vertex w/, i g V, is considered then at least one variable 
yij must assume value 1. Constraints (Ij) and (Ik) impose 
the Generalized Subtour Elimination Constraints (GSEC) 
[23]. Finally, constraints (11) impose that the first n -\- LB 
vertices in V have to be considered in the optimal solution 
to the problem. 

Note that Formulation 1 can be easily extended to the 
case in which the haplotypes are represented by multi- 
character data, i.e., sequences over an alphabet H = 
{0, 1,2, . . . , }/}. In fact, in such a case it is sufficient to 
replace each character c in the haplotype by a binary y - 
vector V such that the 5-th coordinate of v is equal to 1 
if the character c is equal to 5, 5 G E, and 0 otherwise. 
For example, if the generic haplotype were, for exam- 
ple, the string (AACGT), then it could be represented as 
(1000 1000 0100 0010 0001). 

Reducing model size 

Formulation 1 is characterized by a polynomial number 
of variables and an exponential number of constraints. Its 
implementation can be performed by means of standard 
branch-and-cut approaches that use GSEC separation 
oracles such as those described in [32]. 

It is worth noting that variables x^- and z^-j can be relaxed 
in Formulation 1 as constraints (lc)-(le) and the convex- 
ity constraint (If) suffice to guarantee their integrality in 
any optimal solution to the problem. Moreover, Formula- 
tion 1 can be reduced in size by removing those variables 
that are redundant or whose value is known in the opti- 
mal solution to the problem. For example, variables yij can 
be removed from Formulation 1 as it is easy to realize that 
they are redundant. Similarly, all variables z^-j such that 
i,j G V-u and dij > 1 do not need to be defined as their 
value will be always zero for any 5 G <S and in any feasible 
solution to the problem. Variables Ui, i e Q, do not need to 
be declared as their value will be always 1 any feasible solu- 
tion to the problem. Finally, variables xf-, i e V-u, can be 
removed as their value is univocally assigned by the input 
haplotype set T-L. The reduction process can be further 
combined with the preprocessing strategies described in 
[1] to obtain even smaller formulations. Such strategies 



allow one to remove alleles from the input haplotype set 
H without altering the optimal solution to the problem. 
For example, suppose that the haplotype set T-L is such that 
there exists an allele 5 g 5 such that hi(s) = 1, for all 
hi G H; then it is easy to realize that 5 can be removed 
from S as in any feasible solution to the problem the 5-th 
coordinate of any vertex in the phylogeny would be char- 
acterized by having xf- = I. A similar situation occurs 
when there exists an allele s e S such that hi(s) = 0, for all 
hi el-L, Analogously, suppose that the input haplotype set 
T-L is characterized by equal alleles^ i.e., by the existence of 
two alleles, say Si and 52, such that hi(si) = hi(s2)f for all 
/ G 5. Then it is easy to realize that if one removes one 
of the two alleles from S, say 52, and multiplies the 5i-th 
coordinate by 2 does not alter neither the structure nor the 
value of the optimal solution to the problem. Describing 
all the preprocessing techniques for shrinking the input 
haplotype set T-L is beyond the scope of the present article. 
The interested reader will find a systematic discussion of 
this topic in [1]. 

By applying the previously cited reduction strategies to 
Formulation 1 and denoting S as the set of alleles result- 
ing from the application of the preprocessing strategies 
described in [1], vi/ as the number of alleles in S equal 
to the 5-th, 5 G 5, Z as the set for which variables z^-j are 
defined, R = {n LB -\- l.n -\- LB -\- 2, . . . ,n -\- UB}, and 
Cji = {/ G C : / G Vu}y foi" ^ny C C V^, the following 
reduced formulation for the MPPEP-SNP can be obtained: 

Formulation 2, Reduced Model 



lin Y.'^^i) 



i,JeZ 
S.t. x] < Ui 

s'eS: 



(2a) 

"iseSJeR (2b) 

V 5 G <S, / G V-U^j G Vu' 



(2c) 



^ 4 - hi{s)+^j < 1 V5 G 5, / G Vu.j e Vw 



s'eS: 



^ 4+4-^<i 



s'eS: 



s'eS: 



(2d) 

V5 G 5, G Vn' : iJ G Z 

(2e) 

V s G <S, G Vn' ' hj G Z 

(2f) 

V Uj eV\R:iJeZ (2g) 
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seS 



S€S 



V i € R,j € V : i,j € Z (2h) 
Vj eR.ieV -.ij eZ (2i) 
VieQ (2j) 



Vie/? 



(2k) 



i,jeC: ieC'.ieR 
i,JeZ 

vccy:Cnv^7^0 (21) 

i,JeZ 



uu 4, 4pyij e {0, 1}. 



(2n) 



Note that in Formulation 2 variables and z^y cannot be 
relaxed anymore. 



Strengthening valid inequalities 

By exploiting the integrality of variables w/, x^^y and z^^j, 
a number of valid inequalities can be developed to 
strengthen Formulation 2. 



Proof. In a feasible solution to the problem a vertex ui, 
i G Vu'y cannot be a terminal vertex. In fact, if such a 
condition held, a cheaper solution could be obtained by 
dropping Ui from T , contradicting the optimality of T 
itself. Hence, the degree of any vertex in 1/^/ must be at 
least 2. Now, in a feasible solution to the problem variables 
Ui G {0, 1}. If Ui = 0, constraint (4) reduces to 



jeZ 

which is trivially valid. Vice versa, if Ui = 1, constraint (4) 
reduces to 

EE4^2 

J^y- seS 
JeZ 

which is again valid for the above arguments. □ 
Proposition 3, Constraints 

y si,S2 e S : Si ^ 52, e Vu' : Uj e Z (5) 



Proposition !♦ Constraints 



Ui+i <Ui V / G F \ (Q U {fz + UB}) 



(3) 



are valid for Formulation 2, 



Proof In a feasible solution to the problem variable 
Uif i G V \ {Q \J {n -\- UB}), can assume only 
value 0 or 1. If Ui = 0, constraint (3) reduces to 
< 0 which is trivially valid for Formulation 2. If 
Ui = 1, constraint (3) reduces to m^+i < 1 which is 
again valid. □ 

Constraints (3) impose an ordering on the variables Ui, 
i G R, so that vertex Ui-\-i can be considered in the optimal 
solution to the problem only if vertex Ui has been already 
considered. 



Proposition 2» Constraints 



EE4^2«' ^^'^^ 



(4) 



jeZ 



are valid for Formulation 2. 



E 

seS: s^si 



4 



V5i,52 G 5 : 51 7^ 52, /,/ G Vu' - ij ^ Z 

are valid for Formulation 2. 



(6) 



Proof As observed in the previous proposition, in a fea- 
sible solution to the problem ^ ^Jy, G V^/, /,/ G Z, can 

seS 

assume only value 0 or 1. If ^ = 0, then constraint 

(5) (respectively constraint (6)) reduces to -\-x^^ — < 2 
(respectively —x?' -\-x^ < 2), which is trivially valid due to 
the integrality of variables x^-. If X^^e^ — 1> then either 
zjy = 1 or z^j = 1. If X! — 1 th^^ constraint 



565: 



(5), (respectively constraint (6)) reduces to +xf — < 1 
(respectively —xf -\- x^ < 1), which is trivially valid. If 
z^j = 1 then constraint (5) (respectively constraint (6)) 
reduces to -\-xf' — x^^ < 0 (respectively —xf + x^^ < 0), 
which is again valid. ~ 



□ 



Similar arguments can be used to prove the following 
proposition: 
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Proposition 4. Constraints 



Proposition 5. Constraints 



seS: 



V5i,52 e S :si y^S2, i e Vn, j e Vu' 



-hi{s2)+xf <2{l-z'})-Y,Aj 

se<S: 

Vsi,S2 e 5 : Si S2, / G Vu, j e Vu' 



(7) 



(8) 



Vs e 5,? e \7? (9) 

p=l p—1 p=l 

yseSjeR\{n + UB} (10) 
are valid for Formulation 2, 



are valid for Formulation 2. 



Given an input haplotype set 1-L and a pair of non- 
adjacent haplotypes hi and hj, there exit multiple equiv- 
alent paths that may connect hi and hj in the unary 
hypercube. This characteristic constitutes the principal 
class of symmetries for the MPPEP-SNP and may lead to 
poor relaxations for the problem. For example, suppose 
that the input haplotype set 1-L is constituted by haplotypes 
hi = (00) and /z2 = (11). Then a possible solution to the 
instance may consist either of a star centered in haplotype 
/?3 = (10) or a star centered in haplotype h^ = (01) (see 
Figure 1). Note that both solutions are feasible and opti- 
mal for the specific instance. A possible strategy to break 
this class of symmetries consists of imposing the following 
constraints: 



10 



11 



0 



I 

o 



00 



01 



Figure 1 An example of two symmetric paths linlcing haplotypes 
(00) and (11) . 



Proof The statement trivially follows from the integral- 
ity of variables x^- and from constraints (2b). □ 



Constraints (9)-(10) impose an ordering on the coor- 
dinates of the vertices in V^f by means of the smallest 
big-M possible, i.e., a power of 2. Note that the distinction 
between constraints (9) and (10) is necessary, as in princi- 
ple vertices in R may not be needed in the optimal solution 
to the problem. 



Proposition 6. Constraints 



EE^iJ^H E4+1); ^ ^ ^ ^n' \ [n + UB} 

(11) 



jeZ jeZ 



are valid for Formulation 2. 

Proof In a feasible solution to the problem, the sum 
^ z\p ij G V^/, G Z, can assume only value 0 or 1. If 

seS 

J2 ~ ^' constraint (11) reduces to ^ ^ z^- > 

jeZ jeZ 

0 which is trivially valid. Vice versa. If ^ ^ ~ 

jeZ 

constraint (11) reduces to J2 — 1 which is again 

jeZ 

valid due to Propositions 1 and 2. □ 



mg 
to 



Proposition 6 forces vertices in V^f to be sorted accord- 
to a decreasing degree order. In this way, it is possible 



avoid the occurrence of symmetric solutions to the 
problem differing just for the degree of the Steiner vertices 
(see e.g.. Figure 2). 

Let Qs = {/,; e Vu : ^ > 3} and k eVJ< ^ Q3. Then 
the following proposition holds: 
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Figure 2 An example of two symmetric solution to the MPPEP-SNP. 



Proposition 7. Constraints 

E4 + E4;^i ^''/^Q3 (12) 

seS seS 
are valid for Formulation 2. 

Proof, In a feasible solution to the problem the path 
between two distinct haplotypes huhj e 1-L cannot be 
shorter than the distance dh^hj- Hence, if the distance 
between hi and hj is greater or equal to three, vertices / and 
/ cannot be adjacent to a same vertex /c, i.e., only one of the 
two sums ^seS ^seS equal to 1. □ 

Note that if /c g R then (12) can be strengthened by 
replacing the right-hand-side by u^. Moreover, Proposi- 
tion 7 can be generalized as follows. Consider the sets 
Qd = {/,; G Vn : dij >d},de {3,4,. C C V such 
that 2 < |C| < ^ - 1 and C n = 0, and a path p that 
involves only vertices in C. Denote the /c-th vertex in p. 
Then the following proposition holds: 

Proposition 8. The family of constraints 

\c\-i 

seS k=l seS seS 

(13) 

called forbidden path constraints, are valid for Formula- 
tion 2. 

Proof Similarly to Proposition 7, in a feasible solution 
to the problem the path p between two distinct haplo- 
types hi, hj G 1-L cannot be shorter than the distance dk^hj- 
Hence, if the distance between hi and hj is greater or equal 
to d, at most | C| vertices can belong to p, □ 

Experiments 

In this section we analyze the performance of our model to 
solve the MPPEP-SNP. Our experiments were motivated 
by a twofold reason, namely: to evaluate, with respect to 
Formulation 1, the benefits obtained by the removal of 
the redundant variables and by the inclusion of the valid 



inequalities previously described; and to allow the analy- 
sis of larger datasets with respect to the ones analyzable by 
means of [1] s algorithm, currently the best known exact 
approach to solution of the MPPEP-SNP. 

Similar to [1], we emphasize that the experiments aim 
simply to evaluate the runtime performance of our model 
for solving MPPEP-SNP. We neither attempt to study the 
efficiency of MPPEP-SNP for phylogeny estimation nor 
compare the accuracy of our algorithm to phylogeny esti- 
mation solvers that do not use the parsimony criterion. 
The reader interested in a systematic discussion about 
such issues is referred to [19,33]. 

Implementation 

We implemented Formulations 1 and 2 by means of 
Mosel 64 bit 3.2.0 of Xpress-MP, Optimizer version 22, 
running on a Pentium 4, 3.2 GHz, equipped with 2 GByte 
RAM and operating system Gentoo release 7 (kernel linux 
2.6.17). In both formulations, we computed the overall 
solution time to solve a generic instance of the problem 
as the sum of the preprocessing time due to the imple- 
mentation of [22] s reduction rules plus the solution time 
taken by the Optimizer to find the optimal solution to the 
instance. In preliminary experiments, we observed that 
Formulation 2 has two main advantages with respect to 
Formulation 1, namely: it requires much less memory to 
load the model (at least 27% RAM less in the analyzed 
instances) and it is characterized by faster linear relax- 
ations at each node of the search tree. As result. Formu- 
lation 2 allows potentially the analysis of larger instances 
than Formulation 1 and may be characterized by faster 
solution times. Hence, we preferred to use Formulation 2 
in our experiments. 

We considered two different implementations of For- 
mulation 2, namely: the GESC-based Reduced Model 
(GSEC-RM) and the Flow-based Reduced Model (Flow- 
RM). The GESC-RM consists of Formulation 2 strength- 
ened by the valid inequalities previously described. The 
Flow-RM consists of Formulation 2 strengthened by the 
valid inequalities and such that the GSEC are replaced 
by a multi-commodity flows. Specifically, by denoting 
as a decision variable equal to one if there exists a flow 
from vertex 1 to vertex q e passing through edge 
(/,;) G E and 0 otherwise, the Flow-RM can be obtained 
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by replacing constraints (21) with: 

fn^fl^H^ii V^eVw:^^!, i,j&V:i,j&Z 

seS 

(14) 



J2 f^i = '^ yqeVn:q¥=i 



(15) 



(16) 



E 4- E 4 = 1 V,6V«:^^1 



(17) 



(18) 



In preliminary experiments we observed that the Flow- 
RM outperforms the GESC-RM in terms of solution time. 
This fact is mainly due to the computational overhead 
introduced by the GSEC separation oracle which seems to 
be not compensated by the size of the analyzed instances. 
Hence, we did not consider the GESC-RM any further in 
our experiments. 

During the runtime, we enabled the Xpress-MP auto- 
matic cuts and the Xpress-MP pre-solving strategy. More- 
over, we also tested a number of generic primal heuristics 
for the Steiner tree problem to generate a first primal 
bound to the MPPEP-SNP (see, e.g., [34]). Unfortunately, 
in preliminary experiments we observed that the use 
of such heuristics interferes negatively with the Xpress 
Optimizer, by delaying the solution time of the analyzed 
instances. Hence, we disabled the used of the generic 
primal heuristics and enabled the use of the Xpress-MP 
primal heuristic instead. The source code of the algo- 
rithm can be downloaded at http://homepages.ulb.ac.be/~ 
dacatanz/Site/Software_files/iMPPEP.zip. 

Separation oracle for the forbidden path constraints 

When using the Flow-RM, the valid inequalities (3) -(12) 
are loaded together with the reduced model. On the con- 
trary, the valid inequalities (13) are dynamically generated 
by means of a separation oracle working as follows. Before 
loading the reduced model, we precompute the sets Qj, 
for all d e {3, 4, . . . , m}. Let (w, z) be the current frac- 
tional solution at a given node of the search tree and, for 
all d e {3, 4, ... , m}, consider a pair of vertices g Qj. 
Then, the forbidden path constraints (13) are violated if 
there exists a path p having internal vertices in C C Vy 
2<\C\<d-hCnQd = 0, and such that 

ici-i 

E^W + E E^^.mi + E^^ic; > 1^1- (19) 

seS k=l seS seS 



Note that searching for the most violated constraint (19) 
is in general ATT^-hard as it involves solving a longest path 
problem on the weighted graph gJ^^"^, i.e., the graph G 
whose edges are weighted by variables z and whose vertex 
set is restricted to (1/ \ Q^) U {/,/}. In order to have a fast 
separation oracle for the forbidden path constraints we 
do not solve exactly (19) but we use a heuristic approach 
instead. Specifically, we first sort edges of E in decreasing 
order according to their weights and we select two dis- 
tinct vertices vi,V2 e V \ Qd such that edge (vi,V2) has 
the largest weight. Subsequently, we set C = {vi, V2}, mark 
vi and V2 as visited, and build a simple path from vertex 
/ to vertex ; passing by vi and V2. If p is such that (19) is 
satisfied then we add the constraint 
ici-i 

E4i + E E4.Pm + E4,c,; ^ i^i (^o) 



seS 



k=l seS 



seS 



to the formulation; otherwise, we select a different pair of 
vertices in V \ and iterate this procedure until either 
10 distinct paths have been generated or all possible pairs 
of vertices in F \ have been selected. If all vertices have 
been selected but less than 10 distinct paths have been 
generated, then we select a larger subset of F \ (e.g., a 
triplet of vertices) and we iterate again the previous steps. 
It is easy to realize that this procedure may potentially 
generate all the possible paths violating (13). However, 
we stop the procedure after generating 10 paths or after 
considering subset C containing more than 5 vertices as 
we observed in preliminary experiments that this strategy 
provides the best trade-off between speed and tightness of 
the cut. 

Branching strategies 

In preliminary experiments we observed that the stan- 
dard branching strategy implemented in the Xpress-MP 
Optimizer is not appropriate for the problem as it is not 
able to exploit the dissimilarity of the weights in the 
objective function. This inconveniently leads to formula- 
tions characterized by slow solution times. To improve 
this aspect we implemented a different strategy consisting 
of branching on the following constraints: 



E 

i,jeV: 
i,jeZ 



or 



E 

UjeV: 
hjeZ 



4 > 



V5G<S 



yseS 



(21) 



(22) 



where a G {1,2, ...,q} and q = mm{J2keVn n/2}. 
Constraints (21)-(22) limit the number of changes along a 
phylogeny with respect to a given coordinate s e S and 
tend to be more effective when the weights are very 
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dissimilar among them. This branching strategy can be 
implemented by introducing a decision variable 



Pa = 



1 if the overall number of changes at coordinate 

s e S oiT is equal to a 
0 otherwise, 



for all 5 G S and a e {1,2, ... ,q}, and by adding the 
following constraints 



UjeZ 



V5 G S, 



We observed that even better runtime performance can 
be obtained by sorting the coordinates of the input haplo- 
types in decreasing way according to the weights and by 
branching first on variables then on variables uu and 
subsequently on variables x] and finally on variables z^y. 

Performance analysis 

In order to obtain a measure of the performance of 
the Flow-RM, we compared [l]s polynomial-size for- 
mulation versus the Flow-RM on [1] s real instances of 
the MPPEP-SNP, namely: Human chromosome Y con- 
stituted by 150 haplotypes having 49 SNPs each; bac- 
terial DNA constituted by 17 haplotypes having 1510 
SNPs each; Chimpanzee mitochondrial DNA constituted 
by 24 haplotypes having 1041 SNPs each; Chimpanzee 
chromosome Y constituted by 24 haplotypes having 
1041 SNPs each; and a set of four Human mitochon- 
drial DNA from HapMap [35] constituted by 40 haplo- 
types having 52 SNPs each, 395 haplotypes having 830 
SNPs each, 13 haplotypes having 390 SNPs each, and 
44 haplotypes having 405 SNPs each, respectively. Such 
instances consist only of non-recombining data (Y chro- 
mosome, mitochondrial, and bacterial DNA) and can 



be downloaded at http://homepages.ulb.ac.be/'^dacatanz/ 
Site/Software_files/iMPPERzip. 

Table 1 shows the results obtained by such compari- 
son. Specifically, the fourth and fifth columns refer to the 
gaps (expressed in percentage) of the respective formu- 
lations, i.e., to the difference between the optimal value 
to a specific instance and the value of linear relaxation 
at the root node of the search tree, divided by the opti- 
mal value. The table shows that, excluding the cases in 
which the solution to a specific instance was trivially a 
minimum spanning tree (see e.g.. Human chromosome Y, 
Chimpanzee mtDNA, and Chimpanzee chromosome Y), 
the Flow-RM is always characterized by (sometimes dra- 
matically) smaller gaps. This fact derives on the one 
hand from the tightness of the Flow-RM with respect to 
[1] s polynomial-size formulation and on the other hand 
from the efficiency of the strengthening valid inequali- 
ties previously described. The poor relaxations of their 
formulation led [1] to propose an alternative and faster 
exact approach to solution of the MPPEP-SNP based on 
the brute-force enumeration of all possible Steiner ver- 
tices necessary to solve a specific instance of the problem. 
To speed up the computation, the brute-force enumer- 
ation algorithm makes use of a set of reduction rules 
based on Buneman graph enumeration to decrease the 
number of Steiner vertices to be considered. Interestingly, 
despite the differences in terms of implementation lan- 
guage between the two programs (namely, Mosel for the 
Flow-RM and C++ for [1] s brute-force enumeration algo- 
rithm), the Flow-RM proved to be competitive with [1] s 
enumeration algorithm, being able to solve almost all the 
considered instances within 1 second computing time. 
Only in two cases, namely Human mtDNA 40 x 52 and 
Human mtDNA 395 x 830, the Flow-RM needed more 
than 5 minutes to find the corresponding optimal solu- 
tions. The deterioration of the runtime performance in 
those instances is mainly due to the overhead necessary 
to load the formulation (that in both cases is considerably 
bigger than in the other instances) and to an intensive use 



Table 1 Comparison between the gap of [1 ]'s polynomial size integer programming model for the MPPEP-SNP versus the 



gap of the flow-based reduced model and its strengthening valid inequalities 


Dataset 


Haplotypes 


SNPs 


GAP (%) 


GAP (%) 


Optimum 


MST 








[1] 


Flow-RM 




Solution 


Human chromosome Y 


150 


49 


0.00 


0.00 


16 


yes 


Bacterial mtDNA 


17 


1510 


26.04 


20.83 


96 


no 


Chimpanzee mtDNA 


24 


1041 


20.63 


20.63 


63 


yes 


Chimpanzee chromosome Y 


15 


98 


0.00 


0.00 


99 


yes 


Human mtDNA 


40 


52 


24.66 


1.37 


73 


no 


Human mtDNA 


395 


830 


22.64 


7.55 


53 


no 


Human mtDNA 


13 


390 


12.50 


6.25 


48 


no 


Human mtDNA 


44 


405 


6.98 


4.65 


43 


no 
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of the separation oracle for the forbidden path constraints. 
Possibly, a more thorough implementation of the separa- 
tion oracle and the use of more performing languages (e.g., 
C++) could help in speeding up computations in those 
instances at least. 

Interestingly, sometimes in real applications the num- 
ber of haplotypes can be much bigger than the number 
of SNPs. Hence, it is important to test the ability of an 
exact algorithm to tackle instances of the MPPEP-SNP 
containing e.g., hundreds haplotypes. [1] observed that 
their brute force enumeration algorithm is able to tackle 
instances of the problem containing up to 270 haplo- 
types having up to 9 SNPs each within 12 hours com- 
puting time. Unfortunately, the authors also observed that 
their algorithm is unable to solve larger instances of the 
MPPEP-SNP, no matter the maximum runtime consid- 
ered. In this context, the Flow-RM makes the difference, 
being able to tackle instances of the MPPEP-SNP hav- 
ing up to 300 haplotypes and 10 SNPs within 3 hours 
computing time. To show this result, we considered a 
number of random instances of the problem containing 
100, 150, 200, 250, and 300 haplotypes, respectively. Fixing 
the number of haplotypes n G {100, 150, 200, 250, 300}, we 
created an instance of the problem by generating at ran- 
dom n strings of length 10 over the alphabet E = {0, 1}. 
During the generation process, we randomly selected the 
number of SNPs equal to 1 in a given haplotype, and sub- 
sequently we randomly chose the sites of the haplotype to 
be set to 1. We iterated the instance generation process 10 
times for a fixed value of obtaining eventually an over- 
all number of 50 random instances of the MPPEP-SNP 
downloadable at http://homepages.ulb.ac.be/~dacatanz/ 
Site/Software_files/iMPPERzip. 

The results obtained in our experiments are shown in 
Table 2. Specifically, the column "Time" refers to the solu- 
tion time (expressed in seconds) necessary to solve exactly 
a specific instance of the MPPEP-SNP. Analogously, the 
column "Nodes" refers to the number of explored nodes in 
the search tree needed to solve exactly the instance. The 
table does not report on the performance of [l]'s enumera- 
tion algorithm, as their algorithm never found the optimal 
solution to the analyzed instances within the limit runtime 
of 3 hours. As a general trend, the table shows that the 
considered instances can be exactly solved within 1 hour 
computing time. The only exceptions are constituted by 
the 7th instance of the group 150 x 10, the 9th instance of 
the group 200 x 10, the 2th instance of the group 250 x 10, 
and 3th instance of the group 300 x 10 which needed 
8719.65, 4600.69, 7757.98, and 5371.05 seconds, respec- 
tively, to be solved. These instances are much more sparse 
than the others, are characterized by smaller reduction 
ratios, and tend to have more degenerate relaxations than 
the other instances. The presence of these factors might 
explain the loss of performance of the Flow-RM. 



The results showed that the integrality gaps are usually 
very low, ranging from 0% to 4.63% and assuming in aver- 
age a value about 1%, confirming once again the tightness 
of the Flow-RM and the efficiency of the strengthening 
valid inequalities. 

Finally, we also tested the performance of the Flow- 
RM on a set of 5 HapMap Human mitochondrial DNA 
instances of the MPPEP-SNP that were not solvable by 
using [l]'s brute-force enumeration algorithm, namely: fl 
constituted by 63 haplotypes having 16569 SNPs each, 12 
constituted by 40 haplotypes having 977 SNPs each, k3 
constituted by 100 haplotypes having 757 SNPs each, m4 
constituted by 26 haplotypes having 48 SNPs each, and 
p5 constituted by 21 haplotypes having 16548 SNPs each. 
Such instances can be downloaded at the same address 
and consist only of non-recombining data (Y chromo- 
some, mitochondrial, and bacterial DNA). 

A part from m4, all the remaining instances gave rise 
to too large formulations (several hundreds Mbytes RAM) 
to be handled by the Xpress Optimizer. Hence, instead of 
analyzing entirely each instance we decomposed it into 
contiguous SNP blocks and analyzed the most difficult 
block separately. In more in detail, we define Hr to be 
the haplotype matrix obtained by the application of [l]'s 
reduction rules, we sorted the columns of Hr according 
to an increasing ordering of the weights w^, s e S; subse- 
quently, we considered the submatrices obtained by taking 
k contiguous SNPs (or /c-block) in 5, k e {10, 13, 15}. We 
did not consider greater values for k as we observed that 
k = 15 was already a threshold after which the haplo- 
type submatrix gave rise to too large formulations. For 
each /c-block B in Hr we considered the hamming distance 
dhihj — ^ses — hj(s)\ between each pair of distinct 
haplotypes in Hr, and chose the /c-block maximizing the 
sum J2hi,h enrM<h- ^hihj- Finally, we assumed three hours 
as maximum runtime per instance. 

Table 3 shows the results obtained in our experiments. 
As for Table 2, the columns "Time" and "Nodes" refer to 
the solution time (expressed in seconds) and to the num- 
ber of nodes in the search tree necessary to solve exactly a 
specific instance of the MPPEP-SNP, respectively. In such 
a case, the values in the columns "Gap" refers to the gap 
between the best primal bound found within the limit 
time and the root relaxation and "Nodes" refers to the 
number of nodes explored in the tree search within the 
limit time. 

Table 3 shows that, apart from the instances fl and 
m4, the Flow-RM was unable to exactly solve, within the 
limit time, the considered instances for values of k e 
{13, 15}. Specifically, The Flow-RM exactly solved in less 
than a minute the instance fl when considering values of 
k e {10, 13} ; in 20 minutes the instance 12 when con- 
sidering k = 10 ; in less than 3 minutes the instance 
k3 when considering k = 10 ; and the instance m4 in 5 
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Table 2 Performances of the Flow-RM on a set of random instances of the MPPEP-SNP 



\n\ 


Instance 


l-HI post 


Time 


GAP (%) 


Nodes 


\n\ 


Instance 


\n\ post 


Time 


GAP (%) 


Nodes 






reduction 


(sec.) 










reduction 


(sec.) 






100 


1 


57 


520.05 


0 


1807 


150 


1 


82 


284.51 


0 


424 




2 


60 


59.74 


0 


174 




2 


83 


314.27 


0.76 


56 




3 


63 


377.75 


1.45 


110 




3 


81 


799.01 


0 


67 




4 


61 


2491.62 


3.81 


3351 




4 


67 


1809.26 


2.66 


6617 




5 


60 


2918.09 


4.63 


2062 




5 


79 


1001.14 


2.29 


187 




6 


57 


349.54 


1.59 


264 




6 


74 


1976.73 


2.41 


1071 




7 


65 


258.53 


1.90 


85 




7 


73 


8719.65 


3.92 


4814 




8 


58 


293.97 


0 


1299 




8 


83 


3497.73 


2.17 


421 




9 


62 


862.48 


2.85 


540 




9 


72 


1 1 54.77 


2.51 


410 




10 


64 


87.19 


0 


92 




10 


80 


399.89 


1.56 


256 


200 


1 


99 


614.86 


0 


72 


250 


1 


117 


1155.41 


0 


197 




2 


99 


1353.16 


1.28 


149 




2 


109 


7757.98 


1.72 


1596 




3 


96 


896.68 


0.67 


226 




3 


117 


387.141 


0.84 


180 




4 


104 


652.44 


0.47 


150 




4 


126 


1267.77 


0.51 


114 




5 


96 


382.83 


0 


56 




5 


116 


188.188 


0.84 


162 




6 


106 


2535.09 


0.60 


71 




6 


116 


2311.61 


1.14 


685 




7 


100 


233.50 


0 


21 




7 


116 


1256.24 


0 


265 




8 


99 


1650.17 


0.96 


79 




8 


124 


67.556 


0 


528 




9 


87 


4600.69 


2.10 


954 




9 


122 


2000.77 


0.53 


107 




10 


102 


2554.84 


1.23 


1965 




10 


111 


1200.89 


0.87 


272 


300 


1 


133 


297.19 


0 


15 
















2 


123 


2753.53 


0.39 


68 
















3 


142 


5371.05 


0 


941 
















4 


133 


420.72 


0 


43 
















5 


126 


388.99 


0 


433 
















6 


134 


397.01 


0 


61 
















7 


138 


1173.65 


0 


1788 
















8 


126 


666.21 


0 


186 
















8 


127 


449.30 


0.77 


42 
















10 


145 


201.87 


0 


876 















seconds. In contrast, the Flow-RM was unable to solve the 
instance p5, regardless of the value of k considered. In fact, 
already when considering k = 10, the Xpress Optimizer 
took more than 12 hours to exactly solve the instance p5 
and explored over 10 million nodes in the search tree. A 
more detailed analysis of the instance showed that, despite 
the presence of the strengthening valid inequalities, p5 
is characterized by highly fractional relaxations. This fact 
implies the existence of equivalent optimal solutions to 
the instance that, on the one hand, delay the finding of 
a primal bound and, on the other hand, force the Opti- 
mizer to explore many more nodes in the tree search. 
This situation in more pronounced in p5 but also occurs 
in the instances i2 and k3. To improve the tightness of 



the formulation we tried to include in the Flow-RM also 
classical facets and strengthening valid inequalities for the 
Steiner tree problem in a graph (see [23,36-38]). However, 
we did not observe any benefit from the inclusion. We sus- 
pect that the presence of highly fractional solutions to the 
problem could be caused both by poor lower bounds on 
the number of Steiner vertices considered in the Flow-RM 
and by the existence of a number of non trivial classes of 
symmetries still present in the problem. Investigating such 
issues warrants future research efforts. 

In order to measure the performance of the model 
on multi-state character data we also considered [2] set 
of instances of the MPPEP-SNR Specifically, we con- 
sidered the following instances: a set of 41 sequences 
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Table 3 Performances of the Flow-RM on a set of real instances of the MPPEP-SNP 



Dataset 


Haplotypes 


SNPs 


Block Size 


Time (sec.) 


GAP (%) 


Nodes 


MST Solution 


f1 


63 


16569 


10 
13 

15 


1 

56 
10286.1 


1 

3.11 

26.92 


1 

1 

773521 


yes 
no 
no 


i2 


40 


977 


10 


781.85 


20.00 


3751 1 


no 


k3 


100 


757 


10 
13 


150 
588.38 


7.65 
14.29 


353 
11265 


no 
no 


m4 


26 


48 


10 


5 


5.88 
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no 


P5 


21 


16548 


10 


22283.4 


50.79 


6125448 


no 



of O.rufipogon DNA (red rice) having 1043 sites each; 
80 human mtDNA sequences having 245 sites each; 50 
HIV-1 reverse transcriptase amino acid sequences hav- 
ing 176 sites each; a set of 500 sequences of mtDNA 
from the NCBI BLASTN best aligned taxa having 3000 
sites each; a set of 500 sequences of mtDNA from the 
NCBI BLASTN best aligned taxa having 5000 sites each; 
and a set of 500 sequences of mtDNA from the NCBI 
BLASTN best aligned taxa having 10000 sites each. When 
running the same experiments described in [2] we regis- 
tered a very poor performance for the Flow-RM, mainly 
due to the large dimension of the considered instances 
and the presence of symmetries despite the use of con- 
straints (13)- (15). We observed that the combination of 
these two factors increased the runtime performance of 
the Flow-RM of 2-3 orders of magnitude with respect to 
[2] approach. However, we also observed that when per- 
forming [2]'s "window analysis" (i.e., when decomposing 
into blocks of 10 SNPs the input matrix) the Flow-RM 
performed better than [2] s, being characterized by an 
average solution time of 8.27 seconds. This fact suggests 
that, when considering instances constituted by less than 
a dozen sites, an exact approach entirely based on inte- 
ger programming may perform better than the implicit 
enumeration of the vertices of the generalized Buneman 
graph. Vice-versa, for larger instances the implicit enu- 
meration of the vertices of the generalized Buneman graph 
appears more suitable. 

Conclusion 

In this article we investigated the Most Parsimonious 
Phylogeny Estimation Problem from Single Nucleotide 
Polymorphism (SNP) haplotypes (MPPEP-SNP), a recent 
version of the phylogeny estimation problem that arises 
when input data consist of SNPs extracted from a given 
population of individuals. The MPPEP-SNP is ATP-hard 
and this fact has justified the development of exact and 
approximate solution approaches such as those described 
in [1,19,22,26-28]. We explored the prospects for improv- 
ing on the strategy of [1,2] using a novel problem 
formulation and a series of additional constraints to more 



precisely bound the solution space and accelerate implicit 
enumeration of possible optimal phylogenies. We present 
a formulation for the problem based on an adaptation of 
[23] s mixed integer formulation for the Steiner tree prob- 
lem extended with a number of preprocessing techniques 
and reduction rules to further decrease its size. We then 
show that it is possible to exploit the high symmetry inher- 
ent in most instances of the problem, through a series 
of strengthening valid inequalities, to eliminate redun- 
dant solutions and reduce the practical search space. We 
demonstrate through a series of empirical tests on real and 
artificial data that these novel insights into the symmetry 
of the problem often leads to significant reductions in the 
gap between the optimal solution and its non-integral lin- 
ear programming bound relative to the prior art as well 
as often substantially faster processing of moderately hard 
problem instances. More generally, the work provides an 
indication of the conditions under which such an optimal 
enumeration approach is likely to be feasible, suggesting 
that these strategies are usable for relatively large numbers 
of taxa, although with stricter limits on numbers of vari- 
able sites. The work thus provides methodology suitable 
for provably optimal solution of some harder instances 
that resist all prior approaches. Our results may pro- 
vide also useful guidance for strategies and prospects of 
similar optimization methods for other variants of phy- 
logeny inference. In fact, if appropriately adapted, some 
of the results we presented here (e.g., symmetry breaking 
strategies) can be generalized with respect to other phy- 
logenetic estimation criteria (e.g., the likelihood criterion) 
and provide important computational benefits. 
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